Automatic Analysis and indexing of variable-layout documents

نویسندگان

  • Enrico Appiani
  • Anna Maria Colla
چکیده

In this paper a methodology for analysis and automatic indexing of imaged documents within an archiving and retrieval system is described. This system, which is being developed within the Esprit project STRETCH (STorage and RETrieval by Content of imaged documents), is based on a new generation Archiving and Retrieval Engine (ARE), which overcomes the bottleneck of document profiling by alleviating the existing limitations of pre-defined indexing schemes. The ARE exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The objective of this system is twofold. First, it aims at combining direct digitalisation, based on location of information fields and OCR, with advanced techniques derived from Image Analysis and Pattern Recognition. Second, it aims at offering ease of use and programming and ability to dynamically adapt to new types of documents. After experiments in some other document domains (i.e. invoices, medical images, circular letters), the system has been tested on bank documents. In this application, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic indexing of scanned documents: a layout-based approach

Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender’s name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of

متن کامل

Segmentation of Color Documents by Line Oriented Clustering using Spatial Information

In this contribution we introduce a new method for global segmentation of color documents with a structure based on text frames and pictures. It is based on an extensive analysis of the expected shape of clusters in RGB-color space. The method provides an improved segmentation, and gives a proper basis for indexing and layout analysis. Results are very promising. keywords: color documents, docu...

متن کامل

Extensible layout in functional documents

Highly customised variable-data documents make automatic layout of the resulting publication hard. Architectures for defining and processing such documents can benefit if the repertoire of layout methods available can be extended smoothly and easily to accommodate new styles of customisation. The Document Description Framework incorporates a model for declarative document layout and processing ...

متن کامل

A Semi-Automatic Approach of old Arabic Documents Indexing

indexing is a largely used technique in retrieval systems. It has as goal to extract and to represent the meaning of a document so that it can be found by the user. We can cite two types of indexing: manual indexing, and automatic indexing. The automatic indexing requires to use character and words recognition engines which work only over the texts of contemporary documents. In this paper, we p...

متن کامل

مدل دو مرحله ای شکاف- گلچین برای نمایه سازی خودکار متون فارسی

Purpose: Each language has its own problems. This leads to consider appropriate models for automatic indexing of every language. These models should concern the exhaustificity and specificity of indexing.   This paper aims at introduction and evaluation of a model which is suited for Persian automatic indexing. This model suggests to break the text into the particles of candidate terms and to c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000